Method and apparatus for generating a telemetric impulsional response fingerprint for a computer system

ABSTRACT

One embodiment of the present invention provides a system for generating telemetric impulsional response fingerprints for an electronic system. The system operates by first determining a steady-state response of the electronic system under specified initial conditions. Next, the system introduces a sudden impulse step change to a parameter of the electronic system and then measures the dynamic response of the electronic system to the sudden impulse step change. The system then generates a multiparametric representation from the steady-state response and the dynamic response wherein the multiparametric representation simultaneously displays the steady-state response and the dynamic response.

RELATED APPLICATION

The subject matter of this application is related to the subject matterin a co-pending non-provisional application by Kenny C. Gross andLawrence G. Votta Jr. entitled, “Method and Apparatus for Monitoring andRecording Computer System Performance Parameters,” having Ser. No.10/272,680 and filing date 17 Oct. 2002, which is incorporated herein byreference; and to the subject matter in a co-pending non-provisionalapplication by Kenny C. Gross, Lawrence G. Votta Jr., and Adam Porterentitled, “Detecting and Correcting a Failure Sequence in a ComputerSystem Before a Failure Occurs,” having Ser. No. 10/777,532 and filingdate 11 Feb. 2004, which is incorporated herein by reference.

BACKGROUND

1. Field of the Invention

The present invention relates to techniques for enhancing reliabilitywithin computer systems. More specifically, the present inventionrelates to a method and an apparatus for proactively monitoring computersystem components for faults by using telemetric impulsional responsefingerprints.

2. Related Art

As electronic commerce grows increasingly more prevalent, businesses areincreasingly relying on enterprise computing systems to processever-larger volumes of electronic transactions. A failure in one ofthese enterprise computing systems can be disastrous, potentiallyresulting in millions of dollars of lost business. More importantly, afailure can seriously undermine consumer confidence in a business,making customers less likely to purchase goods and services from thebusiness. Hence, it is critically important to ensure high availabilityin such enterprise computing systems.

To achieve high availability in enterprise computing systems it isnecessary to be able to capture unambiguous diagnostic information thatcan quickly pinpoint the source of defects in hardware or software. Ifsystems have too little event monitoring, when problems crop up at acustomer site, service engineers may be unable to quickly identify thesource of the problem. This can lead to increased down time, which canadversely impact customer satisfaction and loyalty.

One approach to address this problem is to monitor all aspects of acustomer's data center and to send the monitored signals to a centralmonitoring center. This enables system administrators at the monitoringcenter to identify problematic discrepancies in system performanceparameters and, if necessary, to direct service personnel to handlediscrepancies more efficiently.

Existing continuous telemetry systems perform proactive fault monitoringof computer systems through passive surveillance, which does not impactthe monitored system in any way. This approach can catch many types offaults. However, there are other latent faults that may appear onlyduring dynamic stimulation. An analogy of these latent faults is a carthat may have a problem with acceleration. The problem may not revealitself during idling or while cruising at a uniform speed.

Hence, what is needed is a method and an apparatus for proactive faultmonitoring a computer system without the shortcomings described above.

SUMMARY

One embodiment of the present invention provides a system for generatingtelemetric impulsional response fingerprints for an electronic system.The system operates by first determining a steady-state response of theelectronic system under specified initial conditions. Next, the systemintroduces a sudden impulse step change to a parameter of the electronicsystem and then measures the dynamic response of the electronic systemto the sudden impulse step change. The system then generates amultiparametric representation from the steady-state response and thedynamic response wherein the multiparametric representationsimultaneously displays the steady-state response and the dynamicresponse.

In a variation of this embodiment, determining the steady-state responseof the electronic system involves making measurements using a continuoussystem telemetry harness.

In a further variation, determining the steady-state response of theelectronic system involves monitoring temperature, voltage, current,and/or vibration at multiple points within the electronic system.

In a further variation, introducing the sudden impulse step changeinvolves changing a load, a temperature, a voltage, and/or a vibrationwithin the electronic system.

In a further variation, measuring the dynamic response of the electronicsystem involves normalizing the dynamic response for measured systemparameters.

In a further variation, generating the multiparametric representationinvolves creating a Kiviat diagram, which displays both the steady stateresponse and the dynamic response.

In a further variation, the system detects incipient problems in theelectronic system by comparing the multiparametric responserepresentation with a standard multiparametric representation derivedfrom a known good electronic system.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 illustrates an electronic system under test in accordance with anembodiment of the present invention.

FIG. 2 presents a flowchart illustrating the process of generating amultiparametric response representation in accordance with an embodimentof the present invention.

FIG. 3 illustrates a normalized temperature response in time to a stepvoltage change in accordance with an embodiment of the presentinvention.

FIG. 4 illustrates a multiparametric response representation for avoltage step change in accordance with an embodiment of the presentinvention.

DETAILED DESCRIPTION

The following description is presented to enable any person skilled inthe art to make and use the invention, and is provided in the context ofa particular application and its requirements. Various modifications tothe disclosed embodiments will be readily apparent to those skilled inthe art, and the general principles defined herein may be applied toother embodiments and applications without departing from the spirit andscope of the present invention. Thus, the present invention is notintended to be limited to the embodiments shown, but is to be accordedthe widest scope consistent with the principles and features disclosedherein.

The data structures and code described in this detailed description aretypically stored on a computer readable storage medium, which may be anydevice or medium that can store code and/or data for use by a computersystem. This includes, but is not limited to, magnetic and opticalstorage devices such as disk drives, magnetic tape, CDs (compact discs)and DVDs (digital versatile discs or digital video discs), and computerinstruction signals embodied in a transmission medium (with or without acarrier wave upon which the signals are modulated). For example, thetransmission medium may include a communications network, such as theInternet.

Overview

Research has shown that a computer field replaceable unit (FRU) canexhibit a wide range of subtle, incipient problems which can beamplified and easily spotted if one examines the dynamic response of theFRU just before and just after a well defined dynamic-stimulusperturbation. The present invention provides a method and apparatus forcreating telemetric impulsional response fingerprints (TIRF) which canbe used to detect such incipient problems, and to thereby enhancereliability, availability, and serviceability of enterprise computersystems.

The TIRF provides a new and unique “active probe” machine-learningtechnique that leverages continuous system telemetry to provide dynamic,multivariate “fingerprints” for FRUs that can be (1) compared withprevious TIRFs for the same FRU, or (2) compared with TIRFs for “GoldenFRUs” generated from FRUs that are certified to be operating nearlyperfectly. These fingerprints can be used to recognize very subtlefailure precursors, such as aging processes, degrading sensors,delamination of bonded components, solder-joint cracking, deteriorationof socket connectors, and other mechanisms that may not show up duringconventional ongoing reliability testing (ORT) or reliability qualitytesting (RQT) test sequences.

A continuous system telemetry harness (CSTH) has been developed (seeU.S. patent application Ser. No. 10/272,680 entitled “Method andApparatus for Monitoring and Recording Computer System PerformanceParameters,” filed 17 Oct. 2002). The CSTH monitors temperatures,voltages, and currents throughout a system, as well as some discreteperformance metrics extracted from the operating system. The CSTHprovides signals that can be used to enhance root cause analysis (RCA)following system failures. The signals can also be monitored inreal-time for early warning of the onset of problems.

For the above listed types of reactive and proactive surveillancetechniques for FRUs and systems, the telemetry is passive and does notdisturb the monitored system in any way.

The present invention leverages the CSTH while extending significantlythe range of its diagnostic coverage with a new dynamic probe technique.The new dynamic probe technique described below provides a wealth ofdiagnostic information relating to the health of components, FRUs, andintegrated systems.

The system described herein generates a TIRF for an FRU by:

-   -   (1) introducing a sudden impulse step change in one or more        operational parameters (e.g. load, temperature, voltage)        associated with the FRU;    -   (2) measuring the dynamic response of all monitorable parameters        following the impulse; and    -   (3) creating a multiparametric Kiviat diagram (also known as a        spider plot) that contrasts the post-impulse behavior with the        “reference” behavior. The Kiviat diagram provides a “dynamic        fingerprint” for the FRU.

The TIRF provides a unique and concise representation of the dynamicresponse of the FRU to a controlled perturbation under specified initialconditions. The TIRF can be represented as a vector of signal valuescollected from sensors, arranged in a specific order, and normalized torepresent the post-perturbation vs. pre-perturbation behavior as amultivariate “fingerprint” for that FRU. Furthermore, the TIRF can beplotted in Kiviat diagram format as a human visualization aid to veryreadily highlight exactly where any problems appear. As such, the TIRFprovides a dynamic perturbation response signature of a given FRU underunified values of initial conditions. Note that each FRU can haveseveral TIRFs corresponding to different types of perturbations.

An FRU TIRF is a very concise multivariate descriptor of a given FRUunder specified conditions. Moreover, the collection of TIRFs for FRUsmay have great potential to increase availability of complex enterpriseservers. Along with standard long-duration online tests of FRUs in ORTand RQT, TIRFs can be generated very quickly to represent importantdiagnostic information about the FRUs dynamic operability and, in mostcases, can be obtained without taking the FRU out of service.

Electronic System Under Test

FIG. 1 illustrates an electronic system under test 102 in accordancewith an embodiment of the present invention. During operation,electronic system under test 102 receives soft variables 103, physicalvariables 104, and canary variables 105. In response to these inputvariables, electronic system under test 102 produces monitored output106. Post-measurement operations 108 are performed on monitored output106 to generate multiparametric response representation 110.

Soft variables 103 can include metrics such as load, throughput, andtransaction latencies. These variables are typically derived from theoperating system of electronic system under test 102. Physical variables104 include temperature, voltage, current, and vibration withinelectronic system under test 102. Canary variables 105 include syntheticuser-transactions and quality of performance values for these synthetictransactions.

The testing methodology involves first establishing a steady-state forsoft variables 103, physical variables 104, and canary variables 105.After the steady state has been established, the system takes apre-perturbation snapshot of the system parameters. Next, the systemapplies a sudden impulse change to one or more of the variables. Forexample, one or more voltages applied to the system can be changed, orthe load applied from the canary variables might be stepped to a maximumvalue to stress the system.

After the sudden impulse has been applied, the system measures thedynamic response of electronic system under test 102, and takes apost-perturbation snapshot of the system parameters. The system thenuses the pre-perturbation and the post-perturbation parameters togenerate multiparametric response representation. This representationcan be in the form of a Kiviat diagram.

Generating a Multiparametric Response Representation

FIG. 2 presents a flowchart illustrating the process of generating amultiparametric response representation in accordance with an embodimentof the present invention. The system starts by taking a pre-perturbationsnapshot of system parameters after the system has been allowed to reacha steady-state (step 202). Next, the system introduces a sudden impulsestep change to one or more of the variables (step 204).

After the sudden impulse step change, the system measures the dynamicresponse of electronic system under test 102 (step 206). The system thentakes a post-perturbation snapshot of the system parameters (step 208).Finally, the system generates a multiparametric response representation(a Kiviat diagram) from the pre-perturbation snapshot and the postperturbation snapshot (step 210). This Kiviat diagram can then becompared with a previous Kiviat diagram taken from electronic systemunder test 102 or it can be compared with a Kiviat diagram that wasgenerated from a known good electronic system to determine if electronicsystem under test 102 has any incipient failures.

Normalized Temperature Response

FIG. 3 illustrates a normalized temperature response in time to a stepvoltage change in accordance with an embodiment of the presentinvention. The upper chart in FIG. 3 illustrates a normalized stepvoltage change in several voltages applied to, for example, a systemboard within a computer system. The lower chart in FIG. 3 illustratesthe normalized temperature change at various points related to thesystem board in response to these step voltage changes. Note that whileuseful, these charts, which plot normalized parameter changes withrespect to time, can be difficult to interpret.

Multiparametric Response Representation

FIG. 4 illustrates a multiparametric response representation (Kiviatdiagram) of a response to a voltage step change in accordance with anembodiment of the present invention. The Kiviat diagrams in FIG. 4represent the same inputs and responses illustrated in FIG. 3 above. Theupper diagram in FIG. 4 represents the step voltage change, while thelower diagram represents the temperature response to the step voltagechange.

The inner and outer polygons in the Kiviat diagrams represent theminimum and maximum values for the monitored parameters. These polygonscan be compared with polygons from previous test of the same electronicsystem, or can be compared with polygons from a test on a known goodsystem to determine if there exist any incipient failures within theelectronic system.

The foregoing descriptions of embodiments of the present invention havebeen presented for purposes of illustration and description only. Theyare not intended to be exhaustive or to limit the present invention tothe forms disclosed. Accordingly, many modifications and variations willbe apparent to practitioners skilled in the art. Additionally, the abovedisclosure is not intended to limit the present invention. The scope ofthe present invention is defined by the appended claims.

1. A method for generating telemetric impulsional response fingerprintsfor an electronic system, comprising: determining a steady-stateresponse of the electronic system under specified initial conditions;introducing a sudden impulse step change to a parameter of theelectronic system; measuring a dynamic response of the electronic systemto the sudden impulse step change; and generating a multiparametricrepresentation, which simultaneously displays the steady state responseand the dynamic response.
 2. The method of claim 1, wherein determiningthe steady-state response of the electronic system involves makingmeasurements through a continuous system telemetry harness.
 3. Themethod of claim 1, wherein determining of the steady-state response ofthe electronic system involves monitoring at least one of temperature,voltage, current, and vibration at multiple points within the electronicsystem.
 4. The method of claim 1, wherein introducing the sudden impulsestep change involves changing at least one of a load, a temperature, avoltage, and a vibration within the electronic system.
 5. The method ofclaim 1, wherein measuring the dynamic response of the electronic systeminvolves normalizing the dynamic response for measured systemparameters.
 6. The method of claim 1, wherein generating amultiparametric response representation involves creating a Kiviatdiagram, which displays both the steady state response and the dynamicresponse.
 7. The method of claim 1, further comprising detectingincipient problems in the electronic system by comparing themultiparametric representation with a standard multiparametricrepresentation derived from a known good electronic system.
 8. Acomputer-readable storage medium storing instructions that when executedby a computer cause the computer to perform a method for generatingtelemetric impulsional response fingerprints for an electronic system,the method comprising: determining a steady-state response of theelectronic system under specified initial conditions; introducing asudden impulse step change to a parameter of the electronic system;measuring a dynamic response of the electronic system to the suddenimpulse step change; and generating a multiparametric representation,which simultaneously displays the steady state response and the dynamicresponse.
 9. The computer-readable storage medium of claim 8, whereindetermining the steady-state response of the electronic system involvesmaking measurements through a continuous system telemetry harness. 10.The computer-readable storage medium of claim 8, wherein determining fthe steady-state response of the electronic system involves monitoringat least one of temperature, voltage, current, and vibration at multiplepoints within the electronic system.
 11. The computer-readable storagemedium of claim 8, wherein introducing the sudden impulse step changeinvolves changing at least one of a load, a temperature, a voltage, anda vibration within the electronic system.
 12. The computer-readablestorage medium of claim 8, wherein measuring the dynamic response of theelectronic system involves normalizing the dynamic response for measuredsystem parameters.
 13. The computer-readable storage medium of claim 8,wherein generating a multiparametric response representation involvescreating a Kiviat diagram, which displays both the steady state responseand the dynamic response.
 14. The computer-readable storage medium ofclaim 8, the method further comprising detecting incipient problems inthe electronic system by comparing the multiparametric representationwith a standard multiparametric representation derived from a known goodelectronic system.
 15. An apparatus for generating telemetricimpulsional response fingerprints for an electronic system, comprising:a determining mechanism configured to determine a steady-state responseof the electronic system under specified initial conditions; astep-change mechanism configured to introduce a sudden impulse stepchange to a parameter of the electronic system; a measuring mechanismconfigured to measure a dynamic response of the electronic system to thesudden impulse step change; and a generating mechanism configured togenerate a multiparametric representation, which simultaneously displaysthe steady state response and the dynamic response.
 16. The apparatus ofclaim 15, wherein determining the steady-state response of theelectronic system involves making measurements through a continuoussystem telemetry harness.
 17. The apparatus of claim 15, whereindetermining the steady-state response of the electronic system involvesmonitoring at least one of temperature, voltage, current, and vibrationat multiple points within the electronic system.
 18. The apparatus ofclaim 15, wherein introducing the sudden impulse step change involveschanging at least one of a load, a temperature, a voltage, and avibration within the electronic system.
 19. The apparatus of claim 15,wherein measuring the dynamic response of the electronic system involvesnormalizing the dynamic response for measured system parameters.
 20. Theapparatus of claim 15, wherein generating a multiparametric responserepresentation involves creating a Kiviat diagram, which displays boththe steady state response and the dynamic response.